Objective

The aim of this project is to use a sample body of text - extracted from blogs, twitter and the news - in order to create a prediction algorithm able to predict the next word of a given input phrase. The ultimate goal is to implement this prediction algorithm in a Shiny app. As such, in the interests of usability, the memory requirements and runtime of the prediction algorithm will be a key consideration during its development.

This report covers several areas of the predictive algorithms development. First, the predictive algorithm methodolgy and setup is explained. Subsequently, the algorithm development process is described - this consists of pre-processing carried out on the input texts or “corpus”; exploratory analysis, modelling, benchmark testing and initial output results. Finally, there is a discussion of how the predictive algorithm could be implemented in the Shiny App, and what potential improvements could be made.

How the Algorithm Works?

In this report, two different algorithms are investigated:

  1. Markov Chain Model w/ Stupid Back-Off
  2. Katz Back-Off Model

To generate the training corpus (generated from the news, blog and twitter sources) from which the algorithms extract predictions, the text is word tokenised under the following assumptions:

Both algorithms have the same initial setup process (1-3) - and subsequently diverge:

  1. Text from twitter, blogs and news is tokenised, pre-processed (based on above assumptions) and converted into a frequency table of n-grams (uni-grams, bi-grams, tri-grams and quad-grams)
  2. Input text is parsed into word tokens and pre-processed via same process as text corpus
  3. The last “n” words of the input text are converted into an n-gram (“n” is an input argument to the algorithm, n <= 4) - this is defined as the n-gram prefix of the word to be predicted - i.e. it is the n-gram prefix of the (n+1)-gram text output

[Markov Chain Model - w/ Stupid Back-Off]

  1. Algorithm searches through (n+1)-grams in frequency table to find (n+1)-gram with an n-gram prefix corresponding to the input text
  2. If no matching (n+1)-grams found, input text n-gram is reduced by an order of 1 (i.e. n=n-1) and steps 3-4 are repeated until n=1, otherwise:
    1. Matching (n+1)-grams are selected, ranked by frequency and outputted
  3. If word is not in corpus, output word selected from uni-grams with highest frequencies

[Katz Back-Off Model]

  1. Select values for discounts at the n-gram and (n+1)-gram level
  2. Search through frequency table and find (n+1)-grams completed by the n-gram prefix
  3. If no matching (n+1)-grams found, input text n-gram is reduced by an order of 1 (i.e. n=n-1) and steps 3-5 are repeated until n=1, otherwise:
    1. Calculate conditional probabilities for words that complete observed (n+1)-grams
    2. Calculate conditional probabilities for words that complete unobserved (n+1)-grams
    3. Select the terminating word from observed/unobserved (n+1)-gram with the highest probability
  4. If word is not in corpus, compute output word probabilities from raw frequency totals of uni-grams

See here for how conditional probabilities are calculated for an n=2 Katz Back-Off Model

Development Process

Data Pre-processing

The word prediction algorithm utilises text sourced from blogs, the news and twitter. These datasets were provided as part of the Capstone dataset. In the present analysis, only 10% of the lines were imported from each dataset.

# proportion of lines to read in
readprop <- 0.1
#document details
lang = "en_US"; source_list = c("news","blogs","twitter")

# Output file location
filepath <- paste0("data/",lang,"/modelling/"); 
if(!dir.exists(filepath)) dir.create(file.path(filepath), recursive = TRUE)

#Sample Texts
all_texts <- data.frame(matrix(ncol=5,nrow=3, 
              dimnames=list(NULL, c("text", "source", "lines", "prop", "size"))))
for (i in 1:3){
        # data location
        datdir <- paste0("data/",lang,"/",lang,".",source_list[i],".txt")
        # Sample 50% of lines
        all_texts$lines[i] <- sapply(datdir,countLines)        
        all_texts$text[i] <- paste(readLines(datdir,
                                             round(readprop*all_texts$lines[i]), 
                                             encoding="UTF-8"),collapse=" ")
        all_texts$source[i] <- source_list[i]
        all_texts$size[i] <- paste0(round(file.info(datdir)$size / 1024.0 ^ 2,2),"MB")
        all_texts$prop[i] <- readprop
}

The metadata of each text file was also extracted:

##    source   lines     size
## 1    news 1010242 196.28MB
## 2   blogs  899288 200.42MB
## 3 twitter 2360148 159.36MB

The three texts were combined into a single corpus and tagged with their source (e.g. "twitter) in the metadata. The corpus was then tokenised by word, with punctuation, symbols, numbers, urls and separators removed.

# Create corpus
doc.corpus <- corpus(all_texts, text_field = "text")
docnames(doc.corpus) <- docvars(doc.corpus, "source")

# Tokenise corpus into words
doc.tokens <- tokens(doc.corpus,
                     what = "fasterword",
                     remove_punct = TRUE,
                     remove_symbols = TRUE,
                     remove_numbers = TRUE,
                     remove_url = TRUE,
                     remove_separators = TRUE)

As the task was to create a word prediction algorithm that excluded profanities, it was then necessary to standardise the word tokens by converting them to lower case, and then use the quanteda package tokens_select function to restrict the tokenised corpus to “clean” words from the English dictionary.

doc.tokens.lo <- tokens_tolower(doc.tokens)

if(!file.exists("data/en_US/profanity_list")){
        url <- "https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
        download.file(url, "data/en_US/profanity_list")
}
profanity_list <- data.frame(read.csv("data/en_US/profanity_list.csv", header = FALSE, sep = ","))

#remove profanities
doc.tokens.lo.rm <- tokens_select(doc.tokens.lo, profanity_list$V1, selection = "remove")

#Keep only english words
doc.tokens.clean <- tokens_select(doc.tokens.lo.rm, GradyAugmented, selection = "keep")

The processed corpus was then tokenised into a document frequency matrix (dfm) containing n-grams for n=1,2,3,4. Separate frequency tables were then created for the complete corpus and the individual source texts. To reduce the size of the corpus and increase the speed of the prediction algorithms, only n-grams with a frequency of 3 or more were retained.

# Create n-grams - unigram to pentagram - and convert to a DFM
doc.dfm <- dfm(tokens_ngrams(doc.tokens.clean, n=1:4))

# Create frequency tables - by source
freq.table_by_source <- tibble(textstat_frequency(doc.dfm, groups = source))  %>% dplyr::mutate(ngram = str_count(feature,"_")+1) %>% select(feature,frequency,group,ngram)
freq.table <- tibble(textstat_frequency(doc.dfm))  %>% dplyr::mutate(ngram = str_count(feature,"_")+1) %>%
        select(feature,frequency,ngram)
##           used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells 2535809 135.5    4839162 258.5   3973498 212.3
## Vcells 4484963  34.3  100853164 769.5 125811238 959.9

A summary of the n-gram frequency tables is given below for the combined text sources:

ngram Rows
1 48733
2 1684370
3 4920678
4 6864029

and each individual text source:

ngram group Rows
1 blogs 40494
1 news 34714
1 twitter 30116
2 blogs 829852
2 news 685019
2 twitter 647248
3 blogs 2161041
3 news 1606232
3 twitter 1655109
4 blogs 2884951
4 news 2002044
4 twitter 2160069

From these tables, we note that blogs have the highest number of unique n-grams across all orders, however, the table below shows it also has the highest number of characters:

group characters
blogs 87483880
news 67184048
twitter 62844900

Exploratory Analysis

In this section, the n-grams in the text corpus are explored for each source text.

Uni-grams

Wordcloud
**Figure 1**: Word cloud of 100 most frequent unigrams

Figure 1: Word cloud of 100 most frequent unigrams

As expected, typical “stopwords” are the most common uni-grams across all source texts.

Frequency Plots
**Figure 2**: Uni-gram Frequency Plots; (left) unigram frequencies for top 100 most frequent uni-grams, (right) cumulative frequency density up to 20th percentile of uni-grams.

Figure 2: Uni-gram Frequency Plots; (left) unigram frequencies for top 100 most frequent uni-grams, (right) cumulative frequency density up to 20th percentile of uni-grams.

Additionally, the frequency plots show that the source texts have a similar distribution of uni-grams. 5% of uni-grams contribute to nearly 90% of the total uni-gram frequency in all source texts except news where it is closer to 80%.

Bi-grams

Wordcloud
**Figure 3**: Word cloud of 100 most frequent bi-grams

Figure 3: Word cloud of 100 most frequent bi-grams

While blogs and news show similar bi-grams, twitter appears to have diverged. This perhaps reflects the unique speech patterns observed on the Twitter platform.

Frequency Plots
**Figure 4**: Bi-gram Frequency Plots; (left) unigram frequencies for top 100 most frequent bi-grams, (right) cumulative frequency density up to 20th percentile of bi-grams.

Figure 4: Bi-gram Frequency Plots; (left) unigram frequencies for top 100 most frequent bi-grams, (right) cumulative frequency density up to 20th percentile of bi-grams.

The bi-gram frequency plots reflect this divergence as the most frequent Twitter bi-grams have notably less instances than the top-ranked in blogs and news texts. However, the distribution of Twitter bi-grams after this initial peak ends up following the blogs distribution quite closely.

Tri-grams

Wordcloud
**Figure 5**: Word cloud of 100 most frequent tri-grams.

Figure 5: Word cloud of 100 most frequent tri-grams.

Again, blogs and news texts show very similar tri-grams, whereas twitter appears to have a completely different selection of most frequency tri-grams.

Frequency Plots
**Figure 6**: Tri-gram Frequency Plots; (left) tri-gram frequencies for top 100 most frequent tri-grams, (right) cumulative frequency density up to 20th percentile of tri-grams.

Figure 6: Tri-gram Frequency Plots; (left) tri-gram frequencies for top 100 most frequent tri-grams, (right) cumulative frequency density up to 20th percentile of tri-grams.

The frequency of tri-grams is notably lower than that of uni-grams and bi-grams. The news frequency distribution remains distinct from blogs and twitter - with the higher variation in text patterns (perhaps signifying more advanced/complex communication)resulting in a lowwer curve.

Quad-grams

Wordcloud
**Figure 7**: Quad-gram Word cloud

Figure 7: Quad-gram Word cloud

Blogs and news retain similar quad-grams. Twitter continues to have a notably distinct grouping.

Frequency Plots
**Figure 8**: Quad-gram Frequency Plots; (left) quad-gram frequencies for top 100 most frequent quad-grams, (right) cumulative frequency density up to 20th percentile of quad-grams.

Figure 8: Quad-gram Frequency Plots; (left) quad-gram frequencies for top 100 most frequent quad-grams, (right) cumulative frequency density up to 20th percentile of quad-grams.

The number of instances of the most frequent quad-grams is significantly lower than the lower order n-grams across all source texts. The linear relationship between the CDF and percentile (just before the 5th percentile) corresponds to quad-grams with a single instance. This also suggests that going to higher order n-grams (i.e. penta-grams) is unlikely to improve the performance of the model algorithm (for the given training dataset).

Modelling

The word prediction alogrithms are tested with a set of three text inputs:

  • “Thank you for”
  • “Follow me”
  • “and a case of”

for four different training corpuses:

  1. Combined texts (i.e. blogs, news and twitter)
  2. Blogs
  3. News
  4. Twitter

Thank you for

Markov Chain

results_list_mc[[1]]
##   word.Comb word.blogs word.news word.twitter
## 1       the       your      your          the
## 2      your        the   letting         your
## 3 following     making    making    following
## 4       all        all       the          all
## 5     being      being     being      sharing

Katz Back-Off

results_list_kbo[[1]]
##   word.Comb prob.Comb word.blogs prob.blogs word.news prob.news word.twitter prob.twitter
## 1       the      0.31       your       0.16      your      0.28          the         0.35
## 2      your      0.10        the       0.08   letting      0.09         your         0.09
## 3 following      0.05     making       0.06         a      0.05    following         0.06
## 4       all      0.04        all       0.05    making      0.03          all         0.04
## 5     being      0.04      being       0.03       the      0.03      sharing         0.03

Follow me

Markov Chain

results_list_mc[[2]]
##   word.Comb word.blogs word.news word.twitter
## 1         i         on        on            i
## 2       and        and     today          and
## 3        on         to      <NA>         back
## 4      back         at      <NA>       please
## 5    please        out      <NA>           on

Katz Back-Off

results_list_kbo[[2]]
##   word.Comb prob.Comb word.blogs prob.blogs word.news prob.news word.twitter prob.twitter
## 1         i      0.11         on       0.67        on      0.25            i         0.11
## 2       and      0.08        and       0.07     today      0.25          and         0.08
## 3        on      0.08         to       0.02        to      0.07         back         0.08
## 4      back      0.08         at       0.02       and      0.03       please         0.06
## 5    please      0.06        out       0.02      that      0.03           on         0.06

and a case of

Markov Chain

results_list_mc[[3]]
##   word.Comb word.blogs     word.news word.twitter
## 1       the        the        trying          the
## 2     water       this            in        water
## 3      this      shout           jan          red
## 4    trying   underway international   plagiarism
## 5        in      there          beer       carpal

Katz Back-Off

results_list_kbo[[3]]
##   word.Comb prob.Comb word.blogs prob.blogs word.news prob.news word.twitter prob.twitter
## 1       the      0.16        the       0.16       the      0.09          the         0.27
## 2     water      0.03       this       0.07         a      0.08        water         0.12
## 3      this      0.03      shout       0.02    trying      0.04          red         0.04
## 4       and      0.01   underway       0.02        in      0.04   plagiarism         0.04
## 5        an      0.01      there       0.02       jan      0.04       carpal         0.04

Model Comparisons

The output of the Markov Chain (w/ Stupid Back-Off) and Katz Back-Off model is almost identical for the top three predicted words. From the Katz Back-Off algorithm we also find that the conditional probability of the top word is usually significantly higher than the bottom four words. However, the limitations of the Markov Chain model are also apparent, as the word predictions for the text input “Follow me” is limited to a list of two, whereas the Katz Back-Off model returns five predictions as a result of using unobserved n-grams in the model prediction.

Source Text Comparisons

The top five word predictions generated by the combined corpus generally differs from the model predictions when a single source text is used. This will be a consideration of the final algorithm, as it is well-known that speech patterns will differ between communication platforms (e.g. the bigram input “Follow me back” is a commonly used phrase on Twitter). Ergo, a word prediction tool may benefit (in regards to usability) from using a text corpus comprised of text sourced specifically from the target platform.

App Implementation

With respect to app implementation, the following issues will be addressed:

  • Size of dataset
    • Reducing the size of the n-gram frequency tables can help app usability with respect to both prediction speed and memory requirements. Eliminating features below a cut-off frequency can significantly reduce the size of the dataset. This will be a consideration during app implementation
  • Model Selection
    • Model complexity is a key driver of prediction speed.The Markov Chain model with Stupid Back-Off is faster than the Katz Back-Off model as it searches through only one n-gram order per prediction loop as opposed to two. Additionally, it does not require probability computations. However, a simple Markov chain model struggles compared to the Katz Back-Off model with training datasets where low frequency features have been eliminated as it does not include un-observed n-grams.
  • Accuracy
    • The model accuracy can be tested by predicting n-grams in a test dataset. This test dataset would be drawn from the 90% of lines in the source texts that were not included in the training dataset

Appendix

# =============================        
# Dynamic Katz Back-Off Model
# ============================= 
ngram_kbo_DF <- function(text.input,freq.table,nwords,len=3){
        
        
        text.input <- iconv(text.input, from = "UTF-8", to = "ASCII", sub = "")
        text.token <- tokens(char_tolower(text.input), 
                             remove_punct = TRUE, 
                             remove_symbols = T, 
                             remove_separators = T, 
                             remove_twitter = T, 
                             remove_hyphens = T, 
                             remove_numbers = T)
        
        text.token <- as.vector(unlist(text.token))
        
        text.n <- length(text.token)
        
        nDisc = 0.5
        nloDisc = 0.5
        
        # 1.a Split frequency table by ngram
        #select maximum ngram order to inspect in frequency table
        ngram_max <- min(len+1,max(freq.table$ngram))
        ft_list <- list()
        #create list of ngram frequency tables
        for (j in 1:ngram_max){
                ft_list[[j]] <- freq.table %>% filter(ngram == j)
        }
        rm(freq.table)
        
        # Extract observed n-gram and discounted probability     
        i=0
        len <- min(text.n,len)         
        if(len>0){i = text.n-len}        
        while (0 < text.n - i){
                #convert text input into ngram format
                text.ngram <- paste0(tail(text.token,text.n-i),collapse="_")
                text.ngram.lo <- paste0(tail(text.token,text.n-i-1),collapse="_")
                
                message("Searching for ngram... ", text.ngram)
                
                # 1b. Create dynamic ngram tables
                freq.table_lo_p1 <- ft_list[[max(text.n - i + 1,1)]]
                freq.table_lo <- ft_list[[max(text.n - i,1)]]
                freq.table_lo_m1 <- ft_list[[max(text.n - i - 1,1)]]
                freq.table_uni <- ft_list[[1]]
                
                ngram_match <- freq.table_lo %>% filter(feature == text.ngram)
                ngramlo_match <- freq.table_lo_m1 %>% filter(feature == text.ngram.lo)
                # =============================
                # If word exists in corpus - Katz Back-Off Model
                # =============================        
                if(nrow(ngram_match)>0){
                        
                        # 2. find observed ngram predictions (i.e. (n+1)-gram) and compute discounted probability
                        ngram_Obs_pred <- freq.table_lo_p1 %>% filter(grepl(paste0("^",text.ngram,"_"),feature)) %>%
                                dplyr::mutate(prob.pred = ((frequency - nDisc) / ngram_match$frequency),
                                              next.word = gsub(paste0("^",text.ngram,"_"),"",feature), feature.pred = feature) 
                        # =============================        
                        # If text input is a single word - Unigram Katz Back-Off Model
                        # =============================                         
                        if (nrow(ngram_Obs_pred)>0 && text.n-i == 1){
                                ngram_pred <- unigram_KBO(freq.table_uni,text.token,text.ngram,ngram_Obs_pred)
                                return(ngram_pred[1:min(nwords,nrow(ngram_pred)), c("next.word","prob.pred")])
                                
                                # =============================        
                                # Dynamic Katz Back-Off Model
                                # =============================                            
                        } else if (nrow(ngram_Obs_pred)>0 && text.n-i > 1) {
                                
                                message("Running Dynamic Katz Back-Off Model")
        
                                # 3. find unobserved n-gram predictions i.e. possible (n+1) word of n-gram input
                                #    Need to extract unigrams not in list of predicted next word from observed ngram predictions 
                                ngram_unObs_tail <- freq.table_uni %>% filter(!(feature %in% ngram_Obs_pred$next.word))
                                # 4. Calculate discounted probability mass - weighting for unobserved n-grams
                                #    Computed from observed n-grams that form the tail (i.e. n-1) of the n-gram input
                                prob_mass <- freq.table_lo %>% filter(grepl(paste0("^",text.ngram.lo,"_"),feature)) %>%
                                        summarise(alpha = 1 - (sum(frequency - nloDisc) / ngramlo_match$frequency))
                                # 5. Calculate backed-off probabilities for n-grams
                                # 5a. Generated backed-off n-grams using the unobserved (n+1)-gram tails - see (3)
                                ngram_Bo <-ngram_unObs_tail %>% dplyr::mutate(feature.BO.ngram = paste(text.ngram.lo, feature, sep = "_"))
                                # 5b. Extract observed frequencies of backed-off n-grams
                                ngram_Bo_Obs <- freq.table_lo %>% filter(feature %in% ngram_Bo$feature.BO.ngram,
                                                                         str_count(feature,"_") > 0)
                                # 5c. Identify unobserved backed-off n-grams - using 5b
                                ngram_Bo_unObs <- ngram_Bo %>% filter(!(feature.BO.ngram %in% ngram_Bo_Obs$feature)) %>% 
                                        dplyr::mutate(feature =  feature.BO.ngram) %>% select(-feature.BO.ngram)
                                # 5d. Generate probabilities of observed backed-off n-grams
                                ngram_Bo_Obs <- ngram_Bo_Obs %>% dplyr::mutate(prob  = ifelse(frequency>0,(frequency - nloDisc) / ngramlo_match$frequency,0))
                                # 5e. Generate probabilities of unobserved backed-off n-grams 
                                ngram_Bo_unObs <-ngram_Bo_unObs %>% dplyr::mutate(prob = prob_mass$alpha * frequency / sum(frequency) )
                                # Combine observed and unobserved n-grams
                                ngram_Bo <- rbind(ngram_Bo_Obs,ngram_Bo_unObs)
                                # 6. Calculate (n+1)-gram probability discount mass - using observed (n+1)-gram probabilities
                                prob_mass <- prob_mass %>% dplyr::mutate(alpha2 = 1 - sum(ngram_Obs_pred$prob.pred))
                                # 7. Calculate unobserved backed-off (n+1)-gram probabilities
                                ngram_Bo_unObs_pred <- ngram_Bo %>% dplyr::mutate(feature.pred = paste(text.token[1],feature,sep="_"),
                                                                                  prob.pred = prob_mass$alpha2*prob / sum(prob))
                                # 8. Select prediction with highest probability
                                # Clean data frames
                                ngram_Bo_unObs_pred <- ngram_Bo_unObs_pred %>% dplyr::mutate(next.word = sub(".*_", "",feature.pred)) %>%
                                        select(feature,frequency,feature.pred,prob.pred,next.word)
                                ngram_pred <- ngram_Obs_pred %>% select(feature,frequency,feature.pred,prob.pred,next.word) %>% 
                                        rbind(ngram_Bo_unObs_pred) %>% arrange(desc(prob.pred))
                                return(ngram_pred[1:min(nwords,nrow(ngram_pred)), c("next.word","prob.pred")])
                        }
                }
                i = i + 1
                message(text.n - i)
        }
        message("Running Unigram Model")                                
        ngram_pred <- unigram_pred(freq.table_uni,text.token)
        return(ngram_pred[1:min(nwords,nrow(ngram_pred)), c("next.word","prob.pred")])        
        
}

# =============================        
# Unigram Katz Back-Off Model
# ============================= 
unigram_KBO <- function(freq.table,text.input,text.ngram,ngram_Obs_pred){
        prob_mass<- data.frame(alpha= 1 - sum(ngram_Obs_pred$prob.pred))
        ngram_Bo_unObs <- freq.table %>% filter(ngram == 1, !(feature %in% ngram_Obs_pred$next.word)) %>% 
                mutate(prob.pred = prob_mass$alpha * frequency / sum(frequency),
                       next.word = feature, feature.pred = paste(feature,text.ngram,sep="_")) %>%
                select(feature,frequency,feature.pred,prob.pred,next.word)
        ngram_Bo_unObs
        ngram_pred <- ngram_Obs_pred %>% select(feature,frequency,feature.pred,prob.pred,next.word) %>%
                rbind(ngram_Bo_unObs) %>% arrange(desc(prob.pred))
        return(ngram_pred)
}

# =============================        
# Generate prediction from most frequent words in corpus
# =============================  
unigram_pred <- function(freq.table,text.input){
        ngram_pred <- freq.table %>% filter(ngram == 1) %>% mutate(prob.pred = frequency / sum(frequency),
                                                           feature.pred = paste(text.input,feature,sep="_"),
                                                           next.word = feature) %>%
        select(feature,frequency,feature.pred,prob.pred,next.word) %>% arrange(desc(prob.pred))
        
        return(ngram_pred)
}

# =============================        
# Markov Chain Model
# ============================= 
ngram_mc_DF <- function(text.input,freq.table,nwords,len=3){
        
        
        text.input <- iconv(text.input, from = "UTF-8", to = "ASCII", sub = "")
        text.token <- tokens(char_tolower(text.input), 
                             remove_punct = TRUE, 
                             remove_symbols = T, 
                             remove_separators = T, 
                             remove_twitter = T, 
                             remove_hyphens = T, 
                             remove_numbers = T)
        
        text.token <- as.vector(unlist(text.token))
        
        text.n <- length(text.token)
        
        nDisc = 0.5
        nloDisc = 0.5
        
        # 1.a Split frequency table by ngram
        #select maximum ngram order to inspect in frequency table
        ngram_max <- min(len+1,max(freq.table$ngram))
        ft_list <- list()
        #create list of ngram frequency tables
        for (j in 1:ngram_max){
                ft_list[[j]] <- freq.table %>% filter(ngram == j)
        }
        rm(freq.table)
        
        # Extract observed n-gram and discounted probability
        i=0
        len <- min(text.n,len)        
        if(len>0){i = text.n-len}
        while (0 < text.n - i){
                #convert text input into ngram format
                text.ngram <- paste0(tail(text.token,text.n-i),collapse="_")
                
                message("Searching for ngram... ", text.ngram)
                
                ngram_match <- ft_list[[max(text.n - i + 1,1)]] %>% filter(grepl(paste0("^",text.ngram,"_"),feature))
                
                # =============================
                # If word exists in corpus - Katz Back-Off Model
                # =============================        
                if(nrow(ngram_match)>0){
                        ngram_pred <- ngram_match %>% mutate(next.word = gsub(paste0("^",text.ngram,"_"),"",feature)) %>%
                                arrange(desc(frequency))
                        return(ngram_pred[1:min(nwords,nrow(ngram_pred)), "next.word"])  
                }
                i = i + 1
        }
        message("Running Unigram Model")                                
        ngram_pred <- unigram_pred(freq.table_uni,text.token)
        return(ngram_pred[1:min(nwords,nrow(ngram_pred)), "next.word"])        
        
}